fix(store): scope FSM_SYNC_MODE=nosync to raft-apply callers only by bootjp · Pull Request #600 · bootjp/elastickv

bootjp · 2026-04-23T17:29:44Z

Summary

Follow-up to #592 addressing the codex P2 correctness finding: ELASTICKV_FSM_SYNC_MODE=nosync previously affected direct (non-raft) callers of store.ApplyMutations / store.DeletePrefixAt, which have no raft-log replay as a durability backstop. Splits the API so the knob only relaxes fsync on the raft-apply path.

ApplyMutations / DeletePrefixAt are now unconditionally pebble.Sync. Safe for any direct caller (catalog bootstrap, admin snapshots, migrations, tests).
ApplyMutationsRaft / DeletePrefixAtRaft are the new raft-apply entry points; they observe s.fsmApplyWriteOpts. Called only from kv/fsm.go.

Code evidence for the bug

distribution/catalog.go:630 calls s.store.ApplyMutations directly from CatalogStore.applySaveMutations. main.go:749 wires distribution.NewCatalogStore around the raw raftGroupRuntime.store (a store.MVCCStore), so EnsureCatalogSnapshot -> CatalogStore.Save -> store.ApplyMutations is a production path that never goes through the raft apply loop. Under nosync, a crash before Pebble flush would lose acknowledged catalog writes with no raft entry to replay — exactly the failure mode codex P2 described.

Before / after per call site

Call site	Before	After
`kv/fsm.go` (raft apply: raw / 1PC / prepare / commit / abort / DEL_PREFIX)	`s.fsmApplyWriteOpts`	`s.fsmApplyWriteOpts` (via `*Raft`) — unchanged
`distribution/catalog.go` `CatalogStore.applySaveMutations`	`s.fsmApplyWriteOpts` (UNSAFE)	`pebble.Sync` (always)
`adapter/*_test.go` direct `ApplyMutations` / `DeletePrefixAt` calls	`s.fsmApplyWriteOpts`	`pebble.Sync` (always)
`kv/shard_store.go`, `kv/leader_routed_store.go` wrappers	forwarded once	forwarded twice (both variants)

Test plan

TestDirectApplyWriteOpts_AlwaysSync asserts the direct path resolves to pebble.Sync even when the FSM-apply mode is nosync.
TestDirectApplyMutations_NoSyncConfigured_StillWritesDurably exercises the public ApplyMutations / DeletePrefixAt entry points under a NoSync-configured store and confirms data is visible after a clean reopen.
Existing TestApplyMutations_NoSync* and BenchmarkApplyMutations_SyncMode renamed to *Raft to reflect that the knob now only governs the raft-apply path.
go test -race -count=1 ./... — all packages pass.
golangci-lint run ./... — 0 issues.

/gemini review
@codex review

Previously ApplyMutations/DeletePrefixAt used s.fsmApplyWriteOpts unconditionally, which made ELASTICKV_FSM_SYNC_MODE=nosync affect direct (non-raft) callers that do NOT have raft-log replay as a durability backstop. Concrete production path: distribution.EnsureCatalogSnapshot -> CatalogStore.Save -> store.ApplyMutations. If the process crashed before Pebble flushed, those acknowledged writes could be lost with no raft entry to re-apply. This change splits the API: - ApplyMutations / DeletePrefixAt: always pebble.Sync. Safe for any caller, including those that bypass raft (catalog bootstrap, admin snapshots, migrations, tests). - ApplyMutationsRaft / DeletePrefixAtRaft: governed by s.fsmApplyWriteOpts (ELASTICKV_FSM_SYNC_MODE). Intended solely for the FSM apply loop; only the raft WAL fsync makes pebble.NoSync safe. kv/fsm.go is the only caller of the new *Raft methods. CatalogStore and every adapter test retain the always-sync ApplyMutations, so the nosync opt-in can no longer silently weaken durability outside the raft-apply path. MVCCStore / ShardStore / LeaderRoutedStore all implement both variants; the in-memory mvccStore has no WAL so both delegate to the same body. Tests: - Added TestDirectApplyWriteOpts_AlwaysSync asserting the direct path always resolves to pebble.Sync even when the FSM-apply mode is nosync. - Added TestDirectApplyMutations_NoSyncConfigured_StillWritesDurably for functional coverage of the public entry points under a NoSync-configured store. - Existing sync-mode tests + benchmark renamed to *Raft to reflect that the knob now only governs the raft-apply path. Addresses codex P2 on #592.

gemini-code-assist · 2026-04-23T17:29:47Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

coderabbitai · 2026-04-23T17:29:51Z

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 34 minutes and 39 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 34 minutes and 39 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3178e60a-54ae-47d3-a2ac-3800602c7702

📥 Commits

Reviewing files that changed from the base of the PR and between 0fb7db7 and 343e870.

📒 Files selected for processing (8)

kv/fsm.go
kv/leader_routed_store.go
kv/shard_store.go
store/lsm_store.go
store/lsm_store_sync_mode_benchmark_test.go
store/lsm_store_sync_mode_test.go
store/mvcc_store.go
store/store.go

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch perf/fsm-apply-group-commit

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

chatgpt-codex-connector · 2026-04-23T17:35:41Z

Codex Review: Didn't find any major issues. Can't wait for the next one!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

…ni medium) Address two PR #636 review concerns: Codex P2 — `4 × 1 MiB = 4 MiB` exactly is *over* `MaxSizePerMsg` once protobuf framing overhead is counted. Each pb.Mutation carries the Op tag, Key tag + bytes, and Value length prefix; the pb.Request envelope wraps them; marshalRaftCommand prepends one byte. With a normal-length objectKey the encoded entry is ~4 MiB + 300 B, with a 1 KiB objectKey it grows to ~4 MiB + 4 KiB. Either way the entry falls into etcd/raft's util.go:limitSize oversized-first-entry path (the documented "as an exception" branch), bypassing the `MaxInflight × MaxSizePerMsg` per-peer memory bound the prior commit was trying to enforce. Fix: lower s3ChunkBatchOps from 4 to 3. `3 × 1 MiB = 3 MiB + few hundred bytes` leaves ~1 MiB of headroom even for kilobyte-scale keys, so the entry stays strictly under MaxSizePerMsg and the batched-MsgApp path applies. Per-PUT Raft commit count grows ~5× from the pre-PR-#636 baseline (5 GiB PUT: 320 → ~1707 entries) but each fsync is ~5× smaller; the WAL group commit landed in PR #600 absorbs the rate. Gemini medium — the previous commit also forced the cleanup paths (cleanupPartBlobsAsync, deleteByPrefix, cleanupManifestBlobs, appendPartBlobKeys) onto the same 4-op batch even though they carry no chunk payload. Those ops are pure-key Dels; a 4-op batch amplifies cleanup latency for large objects (5 GiB cleanup → ~1707 batches at the data-write size). Fix: introduce s3MetaBatchOps = 64. With BlobKey-shaped keys ~100 B each, 64 keys × 100 B ≈ 6 KiB per batch, still three orders of magnitude under MaxSizePerMsg. A 5 GiB-object cleanup commits ~80 batches instead of ~1707, so orphaned-blob garbage collection finishes proportionally faster and does not amplify Raft load relative to the data-write path. Updates the four cleanup callsites (adapter/s3.go:1879, 1900, 1933, 2206, 2227) plus the appendPartBlobKeys loop helper. Tests: - TestS3ChunkBatchFitsInRaftMaxSize builds a worst-case PutObject batch (1 KiB objectKey, fully-populated 1 MiB chunk values), marshals via the same protobuf path the real Dispatch uses, and asserts the encoded entry plus the marshalRaftCommand 1-byte framing prefix is strictly under MaxSizePerMsg = 4 MiB. A second assertion pins the headroom margin at >= 64 KiB so a future bump in s3ChunkBatchOps or s3ChunkSize is caught at PR time. - TestS3MetaBatchFitsInRaftMaxSize is the equivalent invariant for s3MetaBatchOps on the cleanup path. Build / vet / lint clean. S3 unit tests pass.

@codex

…s=4 (#636) ## Summary S3 PutObject の Raft entry サイズを `MaxSizePerMsg` (PR #593 で 4 MiB) と整列させます。 **変更:** `s3ChunkBatchOps = 16 → 4`。1 Raft entry = `s3ChunkBatchOps × s3ChunkSize ≒ 4 MiB` に揃える。1 行差分。 ## Why `etcd/raft` の `util.go:limitSize` には「単一 entry が `MaxSizePerMsg` を超えていても、その entry だけは reject せず単独で MsgApp に載せる」という documented exception があります (`"if the size of the first entry exceeds maxSize, a non-empty slice with just this entry is returned"`). 16 MiB entry はこの経路で素通りするので: | 項目 | s3ChunkBatchOps=16 | s3ChunkBatchOps=4 (本 PR) | |---|---|---| | 1 Raft entry size | ~16 MiB | ~4 MiB | | Leader worst-case / peer | 1024 × 16 MiB = **16 GiB** | 1024 × 4 MiB = **4 GiB** | | 3-node cluster (leader) | 32 GiB | 8 GiB | | WAL fsync per entry | 16 MiB | 4 MiB | | 5 GiB PUT の Raft commit 数 | 320 | 1280 (4×) | PR #593 の本文が謳う `1024 × 4 MiB = 4 GiB / peer` の memory bound は **小エントリ前提**で、S3 経路では成立していませんでした。本 PR で S3 が普通の batched MsgApp 経路に乗るようになり、bound が S3 込みでも正確になります。 Raft commit 数が 4× に増える代わりに、各 fsync が 4× 速くなるので、PR #600 の WAL group commit と相殺されてエンドツーエンド throughput はほぼ同等のはずです。 ## Test plan - [x] `go build ./...` clean - [x] `golangci-lint run ./...` 0 issues - [x] `go test ./adapter/ -short -run 'TestS3|S3Server'` pass - [ ] CI ## Follow-ups (別 PR で design doc) 1. **S3 admission control** — 同時並行の S3 PUT body bytes に hard cap を入れて、クライアント並列度が上がっても leader-side memory が `4 GiB × peers` を超えないようにする。 2. **Raft snapshot 戦略 / blob bypass** — follower fall-behind 時の snapshot transfer に 5 GiB blob が乗ると不都合なので、blob を Raft 経路から外して manifest だけ raft で同期するアーキテクチャ。 /gemini review @codex review

@codex

## Summary PR #636 (`s3ChunkBatchOps = 4` で Raft entry を `MaxSizePerMsg = 4 MiB` に整列) のフォローアップとして、design doc を 2 本提出します。コードは含まず docs だけ。 ### 1. `docs/design/2026_04_25_proposed_s3_admission_control.md` **問題**: PR #636 は per-entry の memory accounting を直すが、aggregate (= 同時 PUT 数 × 各 PUT の inflight) は client 並列度で青天井。 **提案**: ノードが受け入れて Raft に commit する前の S3 PUT body bytes を **共通セマフォで hard cap**。 - (A) `Content-Length` で request entry に pre-charge - (B) `flushBatch` の per-batch (4 MiB) で sub-lease — `dispatchAdmissionTimeout` 経過で 503 SlowDown - デフォルト 256 MiB cap (GOMEMLIMIT=1800 MiB の ~14%)、env で tunable - `Retry-After` 付き 503 → AWS SDK が自動 retry - M1〜M4 の段階導入、metrics/Grafana panel まで含む ### 2. `docs/design/2026_04_25_proposed_s3_raft_blob_offload.md` **問題**: 5 GiB PUT が WAL に丸ごと書かれ、follower catch-up と snapshot transfer が O(stored bytes) でスケールしない。 **提案**: Raft が運ぶのは `ChunkRef{SHA256, Size}` (~100 B) と manifest だけ。chunk 本体は別キー名前空間 `!s3|chunkblob|<SHA>` でローカル Pebble に直書き、follower は applyLoop で `chunkref` を見たら **非同期 fetch RPC で peer から取りに行く**。 - 1 GiB PUT の Raft 負荷: ~100 KiB に圧縮 (10⁴× 削減) - Snapshot size: O(manifest count) になる - Reference count + grace period の GC - `ELASTICKV_S3_BLOB_OFFLOAD=true` でオプトイン、legacy `BlobKey` 経路は並走 - M0 spike 〜 M4 migrator までの 5 段階ロールアウト両方とも: - `docs/design/README.md` のファイル名・ヘッダ規約に準拠 (Status: Proposed / Date: 2026-04-25) - 関連 PR (#593, #600, #612, #617, #636) と既存 design (workload isolation, raft grpc streaming) を相互参照 - リスクと open questions を明示 ## Test plan - [x] markdown lint clean (textlint なし、手動確認) - [x] ファイル名規約 `YYYY_MM_DD_proposed_<slug>.md` 準拠 - [x] 既存 design へのリンクが docs/design/README.md と整合レビューは設計の方向性 (admission control の cap 値、blob offload の content-addressing 単位、M0-M4 の milestone 配分) に focus してもらえると助かります。 /gemini review @codex review  ## Summary by CodeRabbit ## Documentation * Added design proposal for S3 PUT request admission control and load management strategy * Added design proposal for S3 object storage optimization approach

bootjp mentioned this pull request Apr 23, 2026

perf(store): add ELASTICKV_FSM_SYNC_MODE for FSM apply fsync opt-out #592

Merged

6 tasks

Merge branch 'main' into perf/fsm-apply-group-commit

343e870

bootjp enabled auto-merge April 23, 2026 17:37

bootjp merged commit d9afba2 into main Apr 23, 2026
8 checks passed

bootjp deleted the perf/fsm-apply-group-commit branch April 23, 2026 17:43

This was referenced Apr 25, 2026

perf(s3): align Raft entry size with MaxSizePerMsg via s3ChunkBatchOps=4 #636

Merged

docs(design): propose S3 admission control + raft blob offload #637

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(store): scope FSM_SYNC_MODE=nosync to raft-apply callers only#600

fix(store): scope FSM_SYNC_MODE=nosync to raft-apply callers only#600
bootjp merged 2 commits intomainfrom
perf/fsm-apply-group-commit

bootjp commented Apr 23, 2026

Uh oh!

gemini-code-assist Bot commented Apr 23, 2026

Uh oh!

coderabbitai Bot commented Apr 23, 2026 •

edited

Loading

Rate limit exceeded

Uh oh!

chatgpt-codex-connector Bot commented Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bootjp commented Apr 23, 2026

Summary

Code evidence for the bug

Before / after per call site

Test plan

Uh oh!

gemini-code-assist Bot commented Apr 23, 2026

Uh oh!

coderabbitai Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Uh oh!

chatgpt-codex-connector Bot commented Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Apr 23, 2026 •

edited

Loading